Attention Is All You Need

Transformer

Posted by Ccloud on 2023-02-02
Estimated Reading Time 5 Minutes
Words 845 In Total
Viewed Times

The blog introduces the paper : Attention is all you need

Introduction

  • Recurrent models and encoder-decoder architectures are widely used so that numerous efforts have continued to push their boundaries. But its inherently sequential nature precludes parallelization within training examples which becomes critical at longer sequence lengths.

  • Attention mechanisms allow modeling of dependencies without regard to their distance in the input or output sequences. Furthermore, attention mechanisms can also be used in conjunction with a recurrent network.

Background

  • To reduce sequential computation, models which use CNN blocks can compute hidden representations in parallel for all input and output positions. In these models, the number of operations required to related signals form two arbitrary input or output positions grows in the distance between positions linearly (ConvS2S) or logarithmically(ByteNet). This makes it more difficult to learn dependencies between distant positions.
  • In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect the author counteract with Multi-Head Attention.
  • Transformer is the first transduction model relying entirely on self attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

Model Architecture

Transformer has an encoder-decoder structure like most competitive neural sequence transduction models. The encoder maps an input sequence of symbol representations to a sequence of continuous representations . Given , the decoder then generates an output sequence ​ of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

architecture

Encoder and Decoder Stacks

Encoder:

The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is , where is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension .

Decoder:

The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

Attention

attention

Scaled Dot-Product Attention

self_attention

Mulit-Head Attention

In this work the author employs 8 parallel attention layers to compose a multi-head attention mechanism for

Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways:

  • In “encoder-decoder attention” layers, the queries come from the previous decoder layer,and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence.
  • The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
  • Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to ) all values in the input of the softmax which correspond to illegal connections

Position-wise Feed-Forward Networks

use a feed-forward network in every attention layer:

Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension . We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation.

Positional Encoding

Since the Transformer contains no recurrence and no convolution so that the position information needs to be injected to the sequence in an other way.

the pos is the position of word in the sequence and the or means the even or odd position index.

Why Self-Attention

  • Total computational complexity per layer
  • The amount of computation that can be parallelized
  • The path length between long-range dependencies in the network

max

The paper

attention-mechanism knowledege in nndlbook(writen by professor Xipeng Qiu)


If you like this blog or find it useful for you, you are welcome to comment on it. You are also welcome to share this blog, so that more people can participate in it. If the images used in the blog infringe your copyright, please contact the author to delete them. Thank you !